Semantic Query Reformulation in Social PDMS 



Angela Bonifati 

LIFL, University of Lille 1 
Cite Scientifique, Lille (France) 

Esther Pacitti 

LIRMM, University of Montpellier II 
Rue Ada, Montpellier (France) 



Via dell'Ateneo Lucano, Potenza (Italy) 

Fady Draidi 

LIRMM, University of Montpellier II 
Rue Ada, Montpellier (France) 



Gianvito Summa 

University of Basilicata 



November 28, 2011 



Abstract 



We consider social peer-to-peer data management systems (PDMS), where 
each peer maintains both semantic mappings between its schema and some ac- 
quaintances, and social links with peer friends. In this context, reformulating a 
query from a peer's schema into other peer's schemas is a hard problem, as it may 
generate as many rewritings as the set of mappings from that peer to the outside 
and transitively on, by eventually traversing the entire network. However, not all 
the obtained rewritings are relevant to a given query. In this paper, we address this 
problem by inspecting semantic mappings and social links to find only relevant 
rewritings. We propose a new notion of 'relevance' of a query with respect to a 
mapping, and, based on this notion, a new semantic query reformulation approach 
for social PDMS, which achieves great accuracy and flexibility. To find rapidly 
the most interesting mappings, we combine several techniques: (i) social links 
are expressed as FOAF (Friend of a Friend) links to characterize peer's friend- 
ship and compact mapping summaries are used to obtain mapping descriptions; 
( ii) local semantic views are special views that contain information about external 
mappings; and (Hi) gossiping techniques improve the search of relevant mappings. 
Our experimental evaluation, based on a prototype on top of PeerSim and a sim- 
ulated network demonstrate that our solution yields greater recall, compared to 
traditional query translation approaches proposed in the literature. 



In the last decade, we have witnessed a dramatic shift in the scale of distributed and 
heterogeneous databases: they have become larger, more dispersed and semantically 
interconnected networks of peers, exhibiting varied schemas and instances. This shift 
in scale has forced us to revisit the assumptions underlying distributed databases and 
consider peer-to-peer (P2P) systems. A P2P data management system (PDMS) [14] 
is an ad-hoc collection of independent peers that have formed a network in order to 
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map and share their data. For example, consider an online scientific community , that 
uses an underlying P2P infrastructure for data sharing. In particular, each peer embod- 
ies a medical doctor or a physician, who enters the community to share her clinical 
cases (yet hiding sensitive patient record data) with a subset of her colleagues, and get 
knowledgeable opinions from them. 

Peers in such example typically have heterogeneous schemas, with no mediated 
or centralized schema. Still, to process a query over the PDMS, the data needs to be 
translated from one peer's schema to another peer's schema. To address this problem, 
PDMS maintains a set of mappings or correspondences between a peer schema and a 
sufficiently small number of other peer schemas, called acquaintances. The mappings 
between the local schema and the acquaintance schema can be manually provided, or, 
alternatively, computed via an external schema matching tool. 

Each doctor likes to exchange specific data about treatments and patients with the 
peers she trusts, and/or she is friend with. Additionally, she may not find the informa- 
tion within her set of acquaintances, and may need to look for colleagues she has never 
met before. 

In order to cope with data heterogeneity in PDMS, queries are formulated against 
a local peer schema, and translated against each schema of the peer acquaintances, 
and transitively on. This problem, called query reformulation, has been addressed 
in the literature by schema mappings tools [23, 25, 6], and proved to be effective in 
PDMS [15]. However, a fundamental limitation of the above tools is the fact that query 
translation is essentially enacted on every peer by tracking all the mappings, whereas in 
a realistic scenario, only semantically relevant mappings must be exploited. E.g. in our 
online community, each doctor would like to exchange specific data about treatments 
and patients only with the peers that provide relevant information (members of the 
same lab or former university mates), rather than with every peer. Similarly, she may 
be willing to know who else, among the doctors in her community, or among her friend 
doctors, has worked on similar cases. 

As the above example (typical of professional social networking) suggests, social 
relationships (or friendships) between community members are also crucial to locate 
relevant information. Similarly, in order to identify relevant mappings, we exploit 
friendship links between peers, in addition to acquaintances, in what we call social 
PDMS. As in social networks, by establishing a friendship link, a peer can become 
friend with a peer pj and share peer information. In our case, the peer information 
we are interested in is local semantic mappings. They express meaningful semantic 
relationships between elements in heterogeneous schemas on different peers. Local 
mappings can be copious. For such a reason and for efficient storage and retrieval 
at peers, local mappings may need to be summarized as Bloom filters. In addition, 
to capture peer friendship, we adopt the Friend of a Friend (FOAF) vocabularies [4]. 
FOAF provides an open, detailed description of users profiles and their relationships 
using RDF syntax. We have adapted the FOAF files to PDMS and extended the FOAF 
syntax to also point to the mapping summaries of a peer's friends. 

In this paper, we tackle the problem of query reformulation for conjunctive queries 

'This example has been inspired by a web-based online trusted physician network, 
https://www.ozmosis.com/home, 'where good doctors go to become great doctors'. 
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in social PDMS. Based on a new notion of relevance of a query with respect to a 
mapping, we propose a semantic query reformulation approach, using both semantic 
mappings and friendship links, thus biasing the query translation only towards relevant 
peers. 

To precisely define the notion of relevance of a query with respect to a mapping, 
we propose a novel metric called AF-IMF measure, which takes into account the se- 
mantic proximity between the query and the local and external mappings. However, 
the above metric would need to be computed distributively, and to do so, would have to 
contact every peer in the network. To address this difficulty, we do store on each peer 
a local semantic view, that offers a synthetic description of the mapping components 
of external peers. To further improve the search of relevant mappings, we combine the 
above local semantic views with gossiping techniques [18]. These techniques refer to 
the probabilistic exchange of mappings between two peers, thus leading to the end- 
less process of making two random peers communicate among each other. We adapt 
gossiping to our context by periodically refreshing the local semantic view on each 
peer, based on gossiped atoms; by means of such semantic views, promising semantic 
paths can be undertaken in the network, such that, for a given query, the most relevant 
mappings can be located and/or the most relevant peer friends can be reached. 
Contributions. We make the following main contributions: 

C i) We propose a novel notion of relevance of query with respect to a mapping, along 
with that of a relevant rewriting; we characterize each mapping in the entire collection 
of mappings present in the network with a new metric, the AF-IMF measure, which 
precisely identifies the most interesting mappings, towards which query translation 
should be directed. 

( ii) We propose query reformulation algorithms that, given an input query Q, and 
a set of mappings between peers schemas, do the following: translate the query into 
Q f only against the relevant mappings by adopting our new evaluation metric; ex- 
ploit friendship links among peers to possible enlarge the set of mappings and bias the 
search towards interesting peers; exploit semantic gossiping to discover new relevant 
mappings and friends, thus increasing the number of query rewritings. To the best 
of our knowledge, these algorithms advance the state of art of query reformulation in 
PDMS (more details in Section 6). 

(Hi) We provide an extensive experimental evaluation by running our algorithms 
on a simulated network built on top of PeerSim, which demonstrates that our solution 
yields greater recall, compared to traditional query translation approaches. 

The paper is organized as follows. Section 2 presents the background and the prob- 
lem definition. Section 3 introduces our framework, while Section 4 and Section 5 
describe our algorithms and the experimental assessment that has been conducted. Fi- 
nally, Section 6 discusses the related work and Section 7 concludes the paper. 

2 Problem Definition 

In this section, we first present the background of schema mappings and P2P networks, 
and then we detail the problem statement. 
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Target 

HealthCarelnst: [0..*] 
-* name 

id <- 1 

I 
I 

Grant: [0..*] \ 
-* amount ] 
^ scientist ! 
institute --- 



Figure 1: A Schema Mapping Example 



2.1 Schema Mapping Model 

Data exchange systems [10] rely on dependencies to specify mappings. Given two 
schemas, S and T, a source-to-target tuple-generating dependency (also called a s-t 
tgd or, equivalently, a tgd) is a first-order formula of the form \/x((f>(x) —> 3y(ip(x, y)), 
where x and y are vectors of variables, x are universally quantified variables and y are 
existentially quantified variables. The body is a conjunctive query (CQ) over S and 
the head ip is a CQ over T. 

Example 1 Consider Figure 1 in which two schemas describing two scientists ' local 
data are depicted. A set of correspondences Vx, tj 2 and U3 connects elements in the two 
schemas. 

We report below two examples of s-t tgdsfor the two schemas above: 
Source-to-Target Tgds 

7711. Vn, I : Hospital(n, I) — > 31: HealthCareInstitution(n, I) 
TO2. Vn, s, a, pi, I : Doctor(?7, s) A Grant(a,p7, n) A Hospital (pi, I) 
— » 31: HealthCareInstitution(pi, I) A Grant(a, 77, /) 

A schema mapping is a triple A4 — (S, T, fi st ) (M. s t, in short), where S is a source 
schema, T is a target schema, p, st is a set of source-to-target tgds. If / is an instance 
of S and J is an instance of T, then the pair (/, J) is an instance of (S, T). A target 
instance J is a solution of M. and a source instance / (denoted J € Sol(A^,/)) iff 
(I, J) |= fi s t, i- e -> I an d J together satisfy the dependencies. 

We distinguish between specific forms of s-t tgds, which are GAV (global-as-view) 
and LAV (local-as-view). A GAV tgd is a formula Vx(<fi(x) — > A(x)), where the head 
is a single atom A(x). Similarly, a LAV tgd is a formula \/x(A(x) —> 3y(ip(x,y)), 
where the body is a single atom A(x). GAV tgds are special cases of more general 
tgds, called GLAV, that contains conjunctions of atoms and existential variables in the 
head. Here and henceforth, we focus on GLAV mappings, thus expressed by means of 
GLAV s-t tgds in [i st , as the others represent more restrictive cases. We denote such 
GLAV mappings with Ai. 



Source 

Hospital: [0..*] 
-> name 



location 




Doctor: [0..* 
-> name 
salary 
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Example 2 (cont'd) Continuing with the above example, a schema mapping is a triple 
M. = (S,T,jj, st ), where S is the source schema, T is the target schema and /i s t = 
{mi,m 2 }. Moreover, M is a GLAV mapping. 

We assume as customary that mappings among schemas are either provided by the 
users or by using external schema mapping tools. 

We build on prior work [25, 6] to define the semantics of query translation. Pre- 
cisely, we denote with inst(S) (inst(T)) the set of instances / (instances J, respec- 
tively). 

Definition 1 (Semantics of query translation) Suppose Qi is a query posed against 
S, and Qj is a query posed against T, j ^ i. Let Q\ denote a translation of Qi against 
T and Qj denote a translation of Qj against S. Then, Q* is correct provided VD S e 
inst(S) : Qj(D s ) = Qj(M(D s f). The translation Q\ is correct provided VA G 
tnst(J) : Q\{D t ) = C\ D ,.. M{D u )=Dt Qi{D k s )). 

In other words, the translation Q* is correct provided Qj applied to the trans- 
formed instance A4(D S ) and Q* applied to D s both yield the same results, for all 
D s e inst(S). Note that in this case, the direction of translation is against that of 
the mapping A4. As in [6], we henceforth call it backward query translation. This 
direction of translation is similar to view expansion, with A4 being the view definition. 
Translating a query Qi posed against S to the schema T of peer pj is aligned with the 
direction of the mapping M, and represents the forward query translation [6]. The for- 
ward direction is more tricky, as the mapping Ai may not be invertible. In fact, there 
are two alternative strategies to make sense of this direction of translation: ( i) obtaining 
the reverse schema mapping Mr 1 [9, 3], such that the query rewriting semantics is the 
same as the backward direction; this strategy applies to the case in which one would 
like to recover the exchanged data, i.e. to find the source instance / from which the 
target instance J has been derived; ( focusing on the computation of a rewriting of a 
conjunctive query Q\ over the source schema, assuming that a source instance / (D s ) 
is already available and adopting the semantics based on certain answers of all possible 
pre-images in such a case, it is possible to reuse the work done in the area of query 
answering using views for data integration [21, 20]. 

Moreover, observe that in our setting we focus on query answering rather than 
on data exchange and on materializing a target instance. In fact, by following the 
semantics given in [11, 6], we adopt the second strategy (ii), that lets us translate 
the query rather than the data and lets us realize query rewriting along the mappings. 
This strategy is more natural in a P2P setting in which we do not need to reverse the 
mappings, and lets us avoid bringing the exchanged data back to the peers. 

2.2 Network Model 

We assume a heterogeneous network of peers p\, . . . ,p„, each peer having a distinct 
relational 2 schema Si, . . . , S n . Let Mij be a generic GLAV mapping between a pair 

2 The extension to nested relational schemas is beyond the scope of our paper and will be addressed as 
future work. 
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of schemas Si, Sj, from peer pi to peer pj. We assume that each peer has only one 
local schema, which may contain key/foreign key constraints, along with data defined 
according to the schema itself. However, for simplicity we ignore the above constraints 
in the query translation process. 

Given a mapping A4 from a peer pi to a peer pj, which we denote with Mij, Mij 
is also called an outward mapping for pi. The peer pj is also called the target peer 
for this mapping. By opposite, a mapping Mji from peer pj to pi is called an inward 
mapping for pi {outward for pj, resp.). Similarly, pi is the target peer in such a case. 

We do not assume a symmetric distribution of the mappings, i.e. with a mapping 
Aiij, we expect that either pi (resp. pj) stores the mapping or both of them. We have 
designed ad-hoc data structures to store mappings on each peer. Details on such data 
structures will be provided in Section 3.2.2. As customary, the network has a dynamic 
behavior, meaning that any peer pi can join or leave the network arbitrarily. 

2.3 Problem Statement 

Without loss of generality, we consider conjunctive queries (CQs), that are expressed 
as conjunctions of atoms a\, ■ ■ ■ , a n , and mappings M composed by mapping rules 
having one or more atoms in the body <p (in the head tp, respectively). 

Given an input query Qi formulated at a peer in the network against an arbitrary 
schema S i7 and a direct outward mapping Mij = Si — > Sj (from Si to Sj) and a direct 
inward mapping Mki = Sk — > Si (from Sk to Si) and, in addition, transitively from 
Sj (resp. Sk) to any other reachable schema Si (resp. S m ) for which it exists, without 
loss of generality, at least an outward mapping Mji (resp. an inward mapping M m k) 
and so on, continuing from Si and S m to any other reachable schema through inward 
or outward mappings. 

Then, the problem can be stated as: 

• finding the relevant rewritings of Qi along and against the direction of the map- 
pings Aiij (Mki, resp.) and Mji (M m k, resp.) and so on, by following the 
mappings which connect the schemas. All the relevant rewritings have to be 
computed by avoiding useless mapping paths from Si to Sj, from Sk to S i7 from 
Sj to Si, from S m to Sk and so on, from any reached schema to any reachable 
schema connected by mappings. 

Notice that the propagation of the input query Qi to all peers in the network leads 
to collect as many rewritings as possible for that query. In fact, the input query Qi can 
be certainly evaluated on the originating peer that hosts the schema Si (upon which 
the query itself has been formulated) but may not be pertinent for all the schemas of 
other peers, unless relevant rewritings can be located. Moreover, the chosen strategy 
by which the results of the rewritten queries are conveyed towards the originating peer 
is a simple one, i.e. the results are unioned and possible duplicates are discarded. 
Alternatively, mapping composition could have been used here, but it falls beyond the 
scope of this paper. 

The rewritings of Qi follow the semantics given in Definition 1, whose correctness 
is proved in [6]. In this paper, we propose a query rewriting strategy different from the 
ones used in previous work [15, 6] in which all possible translations are pursued, since 
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we only exploit relevant translations. To this purpose, Section 3 introduces the notion 
of relevance of a query with respect to a mapping, and that of a relevant rewriting. 

3 A Framework for query reformulation 

In this section, we develop a novel framework for semantic query reformulation 
in social PDMS. This framework relies on several contributions: a precise definition 
of relevance of a query wrt. a mapping; a new metric (AF-IMF) for computing such 
relevance and its supporting data structures; and a distributed method for computing 
AF-IMF in a P2P network. 

3.1 Relevance of a query wrt. a mapping 

To define such relevance, we consider a schema mapping scenario A4 = (S, T, /j, st ), 
where S is a source schema, T is a target schema, fi st is a set of source-to-target tgds 
that express the GLAV mapping. 

Let to <G fist be a s-t tgd 3 of the form Vx(4>(x) — > 3y(ip(x, y)), with <p and ip as 
CQ queries, containing the atoms a\ (X± ),•••, a n (X n ) , with each Xi being an ordered 
set of parameters (xx,x 2 , • • •, x^, and each parameter being a variable $x\. 

Let Q be a CQ containing the atoms ai(Xi), • • • , a n (X n ), with each Xi being an 
ordered set of parameters (xi,x 2 , ■ ■ •, and each parameter being a constant value 
xi or a variable $xi. 

A query atom a^Xi) in <p (V% resp.) is unifiable with a query atom dj(Xj) in Q if 
a unifying substitution of variables and constant symbols exists. More precisely, a 
unification occurs if: 

• (i) label(ai) = label(aj), i.e. both atoms have the same name; 

• (ii) Vxi e Xi, $Xi matches the variable symbol $Xj A (i — j) or $Xi matches the 
constant symbol Xj A (i = j). 

In other words, each query atom a^A^) must match an atom in the body (head, 
resp.) of a tgd with both its label and its set ordered of parameters (xi, ■ ■ •, x m ). Such 
a match follows the rules for atoms unification in Datalog (i.e. constant and variable 
unification). 

Example 3 Consider again Figure 1 and the mapping rules mi and to 2 specified in 
Section 2. 

If we consider as a query Q = Hospital ($x,' SanFrancisco'), this query being 
posed against the source schema of Figure 1, returns the names of all hospitals in 
San Francisco. Q consists of only one atom (Hospital) which has two parameters, a 
variable (i.e. $x) plus a constant value (i.e. "San Francisco"). 

We can now define the relevance of a query with respect to a mapping rule, as 
follows. 

3 Notice that here and henceforth we use mapping rule and s-t tgd as synonyms. 
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Definition 2 (Relevance Forward) Given a schema mapping Mij that maps elements 
of the schema Si into elements of Sj and let m be a mapping rule in ^j, let Ai be the 
set of atoms of m in the body, a query Q posed against Si along the direction of the 
mapping rule is relevant to m ifVa q of Q, a q G Ai, i.e. each atom of Q is unifiable 
with an atom of Ai- 

Definition 3 (Relevance Backward) Given a schema mapping Mij that maps ele- 
ments of the schema Si into elements of Sj and let m be a mapping rule in fiij, let 
Aj = {dj(Xj)} be the set of atoms of m in the head, such that dj(Xj) G Aj if it 
only contains universally quantified variables, a query Q posed against Sj against the 
direction of the mapping rule is relevant to m ifVa q of Q, a q G Aj, i.e. each atom of 
Q is unifiable with an atom of Aj. 

Consequently, we can now define the relevance of a query wrt. the whole mapping, 
as follows. 

Definition 4 (Mapping Relevance) Let M = (S, T, fi st ) be a mapping, where S is a 
source schema, T is a target schema and fx s t be the set of s-t tgds and let m G fi st be 
a mapping rule of such mapping. A query Q posed against the mapping M is relevant 
if there exists at least one mapping rule m G [i s t so that m is forward or backward 
relevant for Q. 

Example 4 (cont'd) Continuing with the example above, shown in Figure 1, it is easy 
to check that the query Q above is forward relevant to both mi and 777,2, according 
to the above definition. If we consider a query Q' = Grant($x, §y, $z) and a query 
Q" = HealthCareInstitution($y, $z), neither Q' nor Q" are backward relevant to ei- 
ther mapping rule. 

Here and henceforth, we will use the term relevance to denote mapping relevance, 
unless otherwise specified. It follows that, if a query Q is relevant to a mapping Mij, 
its translation Q\ is also relevant to that mapping. 

Proposition 1 If a query Q formulated against Si is relevant to a mapping Mij, its 
translation Q l formulated against Sj is also relevant to Mij, and viceversa. 

The proof of the above proposition is straightforward and is omitted for space rea- 
sons. 

The above query Q entails a relevant rewriting, according to the next definition. 
Definition 5 (Relevant Rewriting of a Query) Given a query Q relevant to a map- 
ping Mij = Si — > Sj, its translation Q* is a relevant rewriting of Q against Sj. We 

say that Mij rewrites Q into Q l , denoted by Q — > Q l . 

Based on the above definition, we can now define a rewriting sequence, as follows. 

Definition 6 (Rewriting Sequence) Foraquery Q, ifQ Qi ^ 2 Q-i ■ ■ ■ 

Qn, we say that Q rewrites into Q n l . The mappings Mo\, ■ ■ ■ ,M( n ~i) n are called 

the rewriting sequence. 
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(a) 



(b) 



Figure 2: (a) Useless rewriting sequence, (b) alternative rewriting sequences and rele- 
vant mappings (in bold). 



An example of rewriting sequence starting from the peer p to the peer p 7 is high- 
lighted in bold in Figure 2 (b). 

According to our problem definition, we need to find all the possible rewriting 
sequences of a given input query Q on the initiating peer p . However, a rewriting 
sequence might not always exist between po and an arbitrary peerp„, since there might 
be an intermediate mapping that does not entail a relevant rewriting of the query. We 
denote such mapping as a useless mapping and the entire sequence a useless rewriting 
sequence. An example of such a sequence is depicted in Figure 2 (a), from po to p$, 
where the mapping from peer pi to p 2 is not relevant. Avoiding useless sequences is 
quite straightforward because they can be detected by adopting a local metric to as- 
sess whether the target of the current peer is able to handle the query, before actually 
shipping the query itself to that target. Such evaluation can be done by using the map- 
ping rules themselves, as they are locally stored on the current peer and can be easily 
inquired to that purpose. 

Another issue that often occurs is that of alternative rewriting sequences, as de- 
picted in Figure 2 (b). Indeed, the current peer may have multiple alternative paths 
to rewrite a given query, and may have to choose the most appropriate one. E.g. in 
Figure 2 (b), p could choose among three possible alternatives p lt p 4 and p 5 . Ex- 
haustively pursuing all possible rewritings is obviously not efficient, due to the great 
number of destination peers and rewriting sequences. Moreover, only fews rewritings 
along the sequences may happen to be the most relevant ones, which is always prefer- 
able to pursue. To this purpose, we need a relevance score for each possible rewriting 
sequence (described next) in order to be able to rank the possible rewriting alternatives. 
Consequently, it becomes feasible to rewrite the queries along the most relevant paths 
(e.g. represented by the bold arrows in Figure 2 (b)). 

Remark. We observe that one could apply Definition 4 in a straightfoward manner to 
address the previous problems. However, a relevance score solely based on a local met- 
ric would not be sufficient as it would only check one mapping at a time. Conversely, 
one needs to check an entire rewriting sequence among the possible alternatives. Thus, 
a global metric is needed to assess the relevance of queries with respect to the mappings 
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in a rewriting sequence. 



3.2 Relevance metric 

In this section, we present our novel relevance metric to quantify the degree of rele- 
vance of mappings in the network and the data structures that allow computing it. 

3.2.1 AF-IMF metric 

Our metric which we call AF-IMF, i.e. atom frequency, inverse mapping frequence, is 
an adaptation of the classical information retrieval metric TF-IDF to schema mapping. 
Variations of the TF-IDF weighting scheme are often used by search engines as a cen- 
tral tool for scoring and ranking a document's relevance given a user query. Similarly, 
AF-IMF is a statistical measure to evaluate how important a query atom is to a map- 
ping in the entire collection. The importance increases proportionally to the number of 
times an atom appears in the mapping but is offset by the frequency of the atom in the 
collection. 

In the following, we first define the AF-IMF for an individual mapping rule, then 
we extend it to entire mappings. 

We introduce the atom count in the given mapping rule m, as the number of times 
a given query atom a q fully appears in m by using constant and variable unifications. 
This count is usually normalized by the number of occurrences of all atoms in m. We 
assume that each atom can only appear once in a mapping rule, thus implying that the 
atom frequency can be approximated to 1. 

AF; j = -^J- ~ I 

where riij is the number of occurrences of the considered atom a, in rrij, and the 
denominator is the sum of number of occurrences of all k atoms occurring in the body 
(head, resp.) of rrij, where cii respectively appears. Note that having two separate AF 
on the body and head according to where the atom a, appears in the mapping rule rrij 
is crucial to characterize the forward from backward relevance, respectively. 

The inverse mapping frequency is a measure of the general importance of the atom, 
obtained by dividing the total number of mapping rules by the number of mapping 
rules containing the atom in the body (head, resp.), and then taking the logarithm of 
that quotient. 

IMF; = log n ^ 77 

with \M\ = | U, = i...„ ii st \ being the total number of mapping rules in the network, 
which amounts to the union (without duplicates) of all the source-to-target tgds; and 
\{rrij : ai € rrij}\ being the number of mapping rules where the atom at appears (that 
is riij 7^ 0) in the body (head, resp.). If the atom is not in the network, this will lead to 
a division-by-zero, thus it is common to use 1 + \{rrij : a% <G rrij} \ instead. 

Notice that the computation of AF depends on both the current query atom at and 
the current mapping rule rrij. Differently, the IMF computation does not depend on the 
current mapping rule rrij but only on the current query atom aj. 
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Then, 

(AF-IMF)jj = AFjj x IMF; ~ \ x IMF; 

The above formula implies that the mapping rules with less atoms are preferred 
with respect to those with more atoms. Therefore, a high weight in AF-IMF is reached 
by mapping rules with low total number of atoms, and low frequency in the global 
collection of mapping rules. 

What has been already observed above on forward from backward relevance im- 
plies that a different value of the AF-IMF is computed for atoms appearing in the body 
(head, resp.) of the mapping rules in a similar fashion. 

A further step would lead to extend the above metric for the query atoms a q al- 
together so that it is possible to assign a comprehensive value of relevance the entire 
query Q with respect to the mapping rule nij (as opposed to the previous case, when 
only an individual query atom at was considered). Such step implies a simple mea- 
sure (e.g. the sum) to put together the AF-IMF scores separately obtained by the query 
atoms a q of Q. 

After applying the composition of the above scores, we obtain the overall score for 
the mapping rule nij, as in the following: 

(AF-IMF)j = £\ (AF-IMF)ij 

After defining the notion of AF-IMF for an individual mapping rule m, we now 
extend the definition to the entire mapping M.. We recall that the final goal of our 
metric is to assign a relevance value to those mappings that the current peer is about to 
evaluate in order to realize the query translation of Q. 

Being a mapping scenario M = (S, T, u, st ) defined by means of a set of K (K > 0) 
mapping rules in n, st , we compute the overall AF-IMF score for M. as the sum of the 
AF-IMF scores obtained by each mapping rule m e ji st (according to the forward or 
backward definition of relevance). 

In other words, if the relevance is backward the query Q matches the head side of 
the mapping rule nij (see Definition 3), the AF-IMF computation is done as shown 
below: 

(AF-IMF)_M i head = £ ( AF-IMF) j 

j=i 

where Kh C K is the number of rules m,j G fi st , such that Q matches their head 
side. 

Instead, if the relevance is forward the query Q matches the body side of the map- 
ping rule nij (see Definition 2), the AF-IMF computation is done as shown below: 

( AF-IMF) jvt. body = E(AF-IMF)j 
j=i 

where K b C K is the number of rules nij <G ^ st , such that Q matches their body 
side. 
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Source 



Target 



Hospital: [0 
~> name 

location 
Grant: [0. 
amount 
institution 
manager 
Doctor: [0..* 
---> name 
salary 
— affiliation 
Department: [0 
— > name 
website 
address 



HealthCarelnst: [0.. 
> name 
id 




Figure 3: A Schema Mapping Example 



The overall relevance of the query Q with respect to the entire mapping M. is the 
maximum value between the two formulas above: 

(AF-IMF)x = max((AF-IMF).M, hoad , (AF-IMF) M , body ) 

In such a way, given a query Q as input, the AF-IMF metric assigns a score of 
relevance to each inward and outward mapping of the peer, to let it choose the most 
relevant paths for query translation, i.e. the ones with the highest scores. 

Example 5 Consider Figure 3 that is a slightly different version of Figure 1. A set of 
correspondences V\,V2,V$, V4 and v$ connects elements in the two schemas. 
Assume that the set of corresponding s-t tgds is the one reported below: 

Source-to-Target Tgds 

m\. Vn, I : Hospital(n, I) — > 3/: HealthCareInstitution(n, /) 
m,2- Vn, s, d, dw, da, a, pi, I : Doctor(n, s, d) A Grant(a,pi, n) 
ADepartment(c?, dw, da) A Hospital (pi, I) 
—> 31, g: HealthCareInstitution(jM, /) A Grant(a, n, I, d) 
ADept(ri, g, dw) 
m 3 .\/n,w,a: Department(n, w, a) — > 3g: Dept(n, g, w) 

The mapping Ai among source S and target T includes all the above three mapping 
rules. 

Now, let us imagine that the source peer S is connected to other target peers (T\, 
T2 and T3) all having, for simplicity, an identical target schema T with sets of different 
mappings. Such mappings {M.\, M.2 0- n d M3) are simply variants of M., i.e. map- 
pings derived from A4 by including a different subset of the mapping rules of p s t, as 
specified in the following: 

• Mi : pi st = {mi,m 2 } 
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• M 2 • Mat = {m 1 ,m 3 } 

• M 3 : n st = {"12,1113} 

If we compute the AF-IMF scores for all the mappings above, i.e. Ai, M.\, M. 2 
and M.3, it is easy to check that M. will always get the highest score, since it is the most 
complete mapping. Therefore the peer T, that is connected to S through M, represents 
the most relevant peer to follow in the query reformulation process. 

Nevertheless, a further complication arises since IMF cannot be exactly computed 
as the size of the entire collection of mapping rules at a given time is not known, due 
to the fact that the network is dynamically changing. 

To address this problem, each peer is equipped with a set of semantic data struc- 
tures, that summarizes the local and external mappings of a peer (see next Section for 
details). Thus, by exploiting such data structures, we can compute an approximation 
of IMF for the distributed case, as discussed in Section 3.2.3. 

3.2.2 Semantic Data Structures 

In this section, we first introduce the local semantic data structures stored on each peer. 
Then, in Section 3.2.3, we present how they can be exploited to approximate the IMF 
values. 

Figure 4 represents the local data structures on each peer. Each peer maintains 
a set of local or internal mapping rules 4 , i.e. mapping rules from its local schema 
to the schema of each of its acquaintances, the latter being a selected subset of the 
peer's neighbors [17, 14]. Moreover, it also stores a Local Semantic View (LSV in 
short), that encloses information about external mapping rules (distinct from the local 
ones), selected uniformly at random from the network. This view is used to com- 
pute the relevance values. Precisely, an LSV for each peer consists of: a five-column 
table Mapping-content (Atom, Mapping, SrcPeer, TgtPeer, Peer), and of a two- 
column table View (Peer, Age), with a foreign key constraint between View.Peer and 
Mapping-content. Peer. The Mapping-content relation has a column Atom containing 
the atom of a mapping rule in the network; a column Mapping containing the ID of 
the mapping rule in which Atom appears; a column SrcPeer containing the ID of the 
external source peer to which Mapping is an outward mapping; a column TgtPeer con- 
taining the ID of the external target peer to which Mapping is an inward mapping; a 
column Peer containing the ID of the peer in the network that has provided the current 
tuple in a gossip cycle. The View relation has a column Peer containing the ID of a 
peer in the network; a column Age containing a numeric field that denotes the age of 
the mapping rules since the time in which they have been included within the View. 
Figure 4 shows an example of a LSV on a peer. 

To uniquely identify each mapping rule in the PDMS, we assign an ID to each 
mapping, using cryptographic hash functions (e.g. SHA-1) to reduce the probability of 
collision 5 . 

4 The mapping statements have been omitted from Figure 4 to avoid clutter. 

5 In DHTs or structured P2P networks, on which PDMS are based, a unique key identifier is assigned 
to each peer and object. IDs associated with objects are mapped through the DHT protocol to the peer 
responsible for that object. In our setting, each object is a mapping. 
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Local Data Structures on a peer 



Local Semantic View (LSV) 
Mapping-content 
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p3 


p4 


p6 


Grant 


SHA-l(m2) 


p3 


p4 


p6 













Pi 
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Mapping Summary 



Hospital 


2 


Grant 


1 







Figure 4: Local data structures on a peer. 



As the size of the LSV is limited, it implies that the view entries need to be re- 
placed, based on their age information. In order to maintain each LSV on the peers, we 
adopt classical thread-based gossiping mechanisms, aiming at updating the LSV with 
newly incoming tuples from the outside. In Section 4.4 we provide the details of such 
maintenance. 

Besides local mappings, each peer also maintains an additional descriptive data 
structure of such mappings, called Mapping Summary, which is implemented as a local 
Bloom filter [5]. A Bloom filter is a method for representing a set A = {01,02, ■ ■ ■ ,a n } 
of n elements (also called keys) to check the membership of any element in A. In the 
Mapping Summary, a bit vector v of m bits, initially set to 0, represents the positions 
of k independent hash functions, hi, /12, ■ ■ ■ , hk, each with range {1, ■ ■ ■ , to}. In the 
Mapping Summary, a key is built as follows: for each mapping rule, the conjunction of 
all atoms in the body <j> (in the head ip, resp.) is a key; each individual atom Oj in the 
body <j> (in the head ip, resp.) is a key; each subset ai, . . . , a n of the atoms in the body 
cj) (in the head ip, resp.), such that it exists at least one joined variable in each atom 
Oj and Oj + i, is a key. By enumerating the above keys, each body (head, resp.) of the 
mapping rule has a total of " ( -" 2 +1 - ) entries in a Mapping Summary. Such a combination 
of atoms and/or individual atoms may appear in several distinct local mapping rules on 
that peer. For each atom (or combination of atoms thereof) aei, the bits at positions 
hi(a), /12(a), • • • , hk(a) in v are set to 1. A membership query checks the bits at the 
positions hi(a),h 2 {a), ■ ■ ■ ,h k (a). If any of them is 0, the atom a is not in the set. 
Otherwise, we conjecture that a is in the set, although this may lead to a false positive. 
The aim is to tune k and m so as to have an acceptable probability of false positives. 
The advantage of using Bloom filters resides in the fact that they require very little 
storage, at the slight risk of false positives. Such probability is quite small already for 
a total of 4 different hash functions [13]. Figure 4 shows an example of a Mapping 
Summary on a peer. 

Similarly to the LSV, the Mapping Summary needs to be maintained in the presence 
of changes of the atoms within the mapping rules, and/or additions and deletions of 
the mapping rules themselves. This is done by maintaining in each location I in the 
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bit vector v, a count c(l) of the number of times that the bit is set to 1. The counts are 
initially all set to 0. When insertions or deletions take place, the counts are incremented 
or decremented accordingly. 

Finally, to allow friendship linking among peers, a peer mantains a third structure, 
that is basically a local FOAFfile containing the URIs of its friends FOAF files. When- 
ever a user (or a peer) generates its FOAF file, it can obtain an identity for that file in 
the form of a URI. This URI could point to a reference in a friend's FOAF file. URIs 
correspond to unique peer and object identifiers in a PDMS. In particular, a peer pi 
may need to store into its FOAF file: (1) the list of other peers he knows and he is 
friend with, as a link to its friend's FOAF file (e.g. P3.rdf in the example); (2) possibly, 
the link to its friend's Mapping Summary (e.g. P3_MapSum in the example below). 

<f oaf : knows> 
<f oaf : Peer> 

<foaf :peerID> P3</ f oaf : peerID> 

<rdfs: seeAlso rdf : resource = 

'http : / /www . mirospthree . com/P3 .rdf /> 
<rdfs: seeAlso rdf: resource = 

*http : / /www . mirospthree . com/P3_MapSum' /> 

</f oaf : Peer> 
</ f oaf : knows> 

The main goal of FOAF files is to maintain the current friendship links of a given 
peer. During query translation, the FOAF file is expanded by adding new friends, by 
invoking the Algorithm FindDirectFOAFFriends, described in Section 4. Notice that 
adopting and exploiting the friendship links of a given peer during the query transla- 
tion process is complementary to exploiting the semantic mappings towards the peer's 
acquaintances. In fact, the friendship links are especially useful in the presence of net- 
work churn, as they act as a background network regardless of the peer's acquaintances 
and its direct inward/outward mappings. A more detailed experiment about network 
churn, scalability and the usefulness of FOAF links is provided in Section 5. 

In our model, no peer can access the other peer's mapping summary until an explicit 
friendship link has been established between such peers, thus leading to modify their 
respective FOAF files accordingly. This mechanism gracefully replaces an explicit 
negotiation and coordination among peers for accessing their respective data structures. 
An additional access control mechanism, e.g. [7], can be adopted on top of FOAF files 
to further strenghten the security of the network. 

In the remainder of this discussion and in Section 4, we denote the peers indexed 
in a FOAF file as 'friends'. These represent the peers whose mapping summary can be 
accessed, in order to widen the scope of the queries. In particular, in Section 4, we will 
discuss how to enlarge the set of simple friends of a peer by exploiting friendship links 
in its FOAF file. 

3.2.3 Distributed computation of AF-IMF 

Using the local semantic view and the local mapping rules, we can compute IMFi 
distributively, as follows. Let k be the number of distinct local mapping rules entries 
and let t the number of distinct mapping rules in the LSV. We know by definition that 
the k entries and t entries are not overlapping, thus we may say that locally we have 
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k + t mapping rules. Then, we have to determine what is the approximation of |M|, 
the total number of mapping rules in the collection, possibly without duplicates. We 
may think of computing N, the total number of peers in the network and multiplying 
it by k + t, thus obtaining \M\ = (k + t) x N. Moreover, we observe that N can be 
easily computed if we know the network topology. For instance, for DHTs it suffices to 
record the size of the routing table, which is r = log(N), and by taking the inverse as 
2 r = N. For super-peer networks, we may have an entry point that registers the total 
number of peers N. For unstructured P2P networks, we may rely on flooding to count 
the total number of peers in the network. In a similar way, the \{rriij : di £ m>ij}\ 
can be computed by selecting among the k and t local mappings, those that contain the 
atom di, thus obtaining \{rriij : on € rriij}\ = (hi + t{) x N. 

However, we need to avoid duplicate mapping rules in the previous computation. 
In order to do this, we need to uniquely identify a mapping in the entire network. A 
simple and effective way to do this is to couple each mapping with its signature, using 
a cryptographic hash function (e.g. SHA-1). We present in Section 4 an algorithm to 
compute AF-IMF distributively, that avoids duplicate mappings by using signatures. 
Remark. As a final observation, we underline that the problems illustrated in Figure 2 
are both overcome, since the useless sequences do not affect the AF-IMF metric. More- 
over, AF-IMF enables the search of the most relevant rewriting sequences in a global 
fashion, as expected by our previous reasoning. In the experimental analysis (Sec- 
tion 5), we show the effectiveness of this metric, also when compared to a local metric 
(e.g. by adopting the sole AF as a local metric). 

4 Reformulation Algorithms 

In this section, we illustrate our query reformulation algorithm: the core algorithm that 
translates a query based on relevance; an algorithm for seeking new friends that contain 
relevant mappings for the query; a distributed algorithm to compute the relevance of 
mappings, that is used by the two former algorithms. Finally, we briefly discuss the 
gossiping algorithm for updating semantic views. 

4.1 Distributed computation of the relevance 

Algorithm 1 computes a measure of the relevance of a set of mapping rules on a given 
peer with respect to an input query, with the aim of getting an ordered top-k list of 
mappings to be exploited (by Algorithm 2) and the aim of finding new friends by 
(Algorithm 3). 

The algorithm has two main parts. Lines 1-14 aim at computing the IMF values 
for each query atom, and this entails a separate computation, depending on which side 
of the mapping rule the query atom belongs. Therefore, two vectors BodylMF and 
HeadlMF are built to store the IMF values of each atom in the query Q. 

Then, the second part of the Algorithm (lines 15-32) computes the AF values by 
counting the number of times that a query atom occurs in the matched side of the 
mapping rule, and the complete value of AF-IMF is then returned. The final relevance 
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Figure 5: An example of ComputeRelevance (Algoritm 1). 



value (line 32) for the whole mapping rule is taken by applying a suitable ranking 
function to the values in the above vectors (e.g. sum). 

Let us observe that the computation of the IMF only depends on the atom a, in 
the query Q, and not on the current mapping rule. For this reason, we also make sure 
that the computation at lines 1-14 is done only once for the same query, by saving 
intermediate results. 

Indeed, the computation is done by asking each known peer (both destination peers 
through mappings and new discovered peer friends in the FOAF file /). The maximum 
number of inquiries is given by the REQS threshold. Observe that if REQS = 
no external inquiries have been done, and only the entries of the current peer's LSV 
have been inspected, whereas a value of REQS greater than leads to also inspect 
the LSV of external peers. Also note that such inquiries are done by discarding du- 
plicates through the asynchronous method GetDistinctMappingRules, that checks the 
signatures of the mapping rules. We omit the pseudo-code of this method for space 
reasons. 

Figure 5 shows an example of how Algorithm 1 computes the relevance. A query 
Q is initially posed against the peer po, which in turn chooses among three alternative 
target peers (also called acquaintances). Also, note that from po toward pj there is 
no direct mapping, but rather a FOAF link depicted by a dotted blue arrow. Thus, 
mappings .Mm (from po to pi), M40 (from p^ to po) and .M05 (from po to p§) must be 
evaluated aiming at finding the top-k relevant ones for the input query (in this example, 
we assume for simplicity that k = 1). By inspecting po' s LSV, Algorithm 1 performs 
the computation of the relevance metric for each mapping rule m of each mapping 
involved (A4rji, -M-aq, M05), by assigning an AF-IMF value to each involved atom, as 
previously discussed. At the end, the mapping A^oi (from po to pi) gets the highest 
relevance score amongst all the other mappings, thus becoming the top-1 step in the 
rewriting sequence of query Q. 



17 



Algorithm 1 ! ComputeRelevance 
local semantic view 



computes the relevance according to the gossiped information in the 



Input : A query Q as set of atoms Aq, a list of k mapping rules m k , a peer p with its LSV and FOAF 
file / 

Output: The vector of relevance values RV for the input list of k mapping rules 
foreach atom o 4 in Aq do 

//Compute the IMF value according to the matchedSide 
n = total nr. of mapping rules in the LSV; 

nBt = total nr. of mapping rules in the LSV containing a t in the body; 
nHi = total nr. of mapping rules in the LSV containing a t in the head; 
Let count reqs — 0; 

foreach p' in the View of LSV and in the FOAF file f do 
if count reqa >= REQS then 
|_ break; 

n += GetDistinctMappingRules(p'); 
nBi += GetDistinctMappingRules(// , "Body", a ; ); 
nHi += GetDistinctMappingRules(p', "Head ", a»); 

count reqs ++; 

BodyIMF[i] = log(n / (1 + nBA); 
HeadIMF[i\ = log(ra / (1 + nHi)); 

15 foreach mapping rule m k in the list of input mapping rules do 
if all atoms o 4 in Aq are in the body of m k then 

matchedSide = "Body"; 

else 

if all atoms o 4 in Aq are in the head ofu k then 

matchedSide = "Head"; 

else 

//No relevance 
RV[k]=0; 
continue; 

foreach atom m in Aq do 

AF - IMF[i] = 0; 

//Compute the AF-IMF value for a ; according to the matchedSide 
if matchedSide == "Body" then 

BodyAFi = count of the nr. of aj in the body of m k 

AF - IMF[i] = BodyAFi * BodyIMF[i]; 

else 

HeadAFi = count of the nr. of a ; in the head of m k 

AF - IMF[i] = HeadAFi " HeadIMF[i]; 

//Compute the final relevance value RV for the whole mapping rule m k 
RV[k] = RankFn(AF - IMF[i]) 

33 return RV; 



4.2 Translating queries based on relevance 

Algorithm 2 translates a query initiated at a peer, first against its set of local mappings 
and then by exploiting local friendship links at that peer. The algorithm is inherently re- 
cursive, and at each iteration increases the number of query hops, until a given threshold 
a is reached. This avoids exploring the entire network, by conveying the query toward 
a limited number of peers. By exploiting the notion of relevance for the input query Q, 
new friends are discovered and added to the FOAF friend list. 

By invoking the method FindDirectFOAFFriends (line 4), the current peer enlarges 
the list of its friends in its local FOAF file. Therefore, new relevant friends might 
be discovered, similarly to real-life friendship mechanisms, and to friend-bases game 
applications (e.g. Farmville) in modern social platforms (e.g. Facebook). Lines 6-8 
invoke the method ComputeRelevance for each local mapping Mi of the peer, in order 
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Figure 6: An example of Query Translate (Algorithm 2). 



to get the relevance scores for such local mappings. Then, at line 9, the list of local 
mappings is ordered according to the the calculated relevance scores with respect to 
the input query Q. 

Mapping identity is checked in lines 10-11, in order to avoid using the same map- 
pings more than once in different iterations. The query rewriting proceeds by taking 
into account the direction of the mapping (cfr. Definition 1) and then can take place 
along (line 12) or against (line 21) the mapping, thus obtaining the translated query 
Q'. Then, according to the type of the mapping considered - if inward or outward, the 
input query Q and the translated one Q' are executed against the current peer or instead 
used in the recursive call of the Algorithm. Next, the query translation task is pushed 
towards the new interesting peer friends encoded in the FOAF file / (lines 31-34). This 
search exploits the peer friends' Mapping Summary to check whether there is a high 
number of mappings that contains atoms of the input query Q (via the method Com- 
puteFriendsWithGreatestCount). Finally, all the query results res are returned (line 
34) as the union of all the results harvested throughout the recursive invocations of the 
algorithm. 

Figure 6 shows an example of execution of Algorithm 2. A query Q is posed 
against the peer pq. In trying to choose the most relevant rewriting sequence (lines 
5-30), po applies Algorithm 1 (for simplicity, we assume that top-k = 1). This way, 
the query Q is rewritten and traslated transitively until p§ is reached. No translation 
is further possible, since p 6 is a terminal node. However, FOAF links found in line 4 
of the Algorithm 2 are also exploited in this example. Indeed, they allow to traverse 
disconnected subsets of the nodes in the graph of Figure 6 (lines 31-34). If the friend 
reachable through the link is able to treat the query, the query can be further propagated 
to that friend and its subgraph. In the figure, one can see that p-j and pn receive the 
queries Q and Q'S respectively from p and p 3 . By contrary, p 14 is not able to treat the 
query that pg holds, thus such a query is not propagated further. Obviously, each friend 
would further spread the query, thus increasing the total number of relevant rewri tings. 

The following proposition holds. 
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Algorithm 2! TranslateQuery - Query translation based on relevance 



Input 



: Query Q as set of atoms Aq and a peer p with its list MList of local mappings £i<i< 
and its FOAF file / 

Output: Query results res of the query Q against the peer p exploiting both the set of local relevant 

mappings Ei<i<„/i, and new relevant peer friends 
if Q.query-hops > a then 
l_ return res; 

increase Q.query-hops by 1 ; 
Call FindDirectFOAFFriends(Q, p); 

Let L be a list of mappings ordered by relevance; 
foreach local mapping Mi In MList do 
| RV = Call ComputeRelevance(Q, .Mi.MappingRulesO); 
j mapscore[i] = SumvaluesFromVector(.RV); 

L = Order MList according to mapping relevance values in mapscore; 



Mi 



1 

2 

3 
4 
5 
6 
7 
8 

9 

10 foreach top-k ordered mapping Mi In L do 
11 
12 

13 
14 
15 
16 
17 
18 
19 
20 
21 

22 
23 
24 
25 
26 
27 
28 
29 
30 



if Mi has been already processed then 
j continue; 
j_ //To avoid cycles 

Let destPeer the destination peer through mapping Mi\ 
if Q is relevant to the body of Mi then 
Translate Q along Mi obtaining Q' 
if Mi is outward then 

res = res U Eval(Q); 

res = res U TranslateQueryfQ', destPeer); 

else 

res = res U Eval(Q'); 

res = res U TranslateQuery(Q, destPeer); 



else 



if Q is relevant to the head of Mi then 
Translate Q against Mi obtaining Q' 
if Mi is outward then 

res = res U Eval(Q'); 

res = res U TranslateQuery(Q, destPeer); 

else 

res = res U Eval(Q); 

res = res U TranslateQuery(Q', destPeer); 



31 Let F be a list of friends; 

32 F = Call ComputeFriendsWithGreatestCount(Q, /); 

33 foreach top-k ordered friend pFoaf in F do 

j //To exploit new interesting peer friends 

34 |_ res = res u TranslateQuery(Q, pFoaf); 

35 return res; 



Proposition 2 If \Aq\ is the size (number of atoms) of an input query Q and \M r \ the 
number of the relevant mappings in the PDMS then the number ofrewritings generated 
by TranslateQuery is O ( | M r \ \ A i I ). 

4.3 Seeking new friends 

Algorithm 3 updates the FOAF file of a given peer, by adding new friends, discovered 
after an exhaustive inspection of the content of the local semantic view of a peer. Before 
adding a peerp' to the FOAF file of the current peer, a formal invitation is sent and must 
be accepted. A simple extension of Algorithm 3 can be thought, in which an external 
peer, which is not friend of a friend, is added to the FOAF file. We omit its pseudocode 
for the sake of conciseness. 
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Algorithm 3! FindDirectFOAFFriends - Finds the top-k relevant "Simple Friends" and adds their entries 
in the FOAF file 

Input : A query Q as set of atoms Aq and a peer p with its list LSVList of mappings Hi<i<„ Ali in 
the peer's local semantic view (LSV) and a FOAF file / 

Output: The updated FOAF file / 

1 Let L be a list of mappings ordered by relevance; 

2 foreach mapping Mi in LSVList do 

3 I RV = Call ComputeRelevance(Q, .Mi.MappingRulesO); 

4 |_ mapscore [i] = SumvaluesFromVector(.RV); 

5 L = Order LSVList according to mapping relevance values in mapscore; 

6 foreach top-k mapping M ; in the ordered list L do 

7 Let p the target peer through Mu 

8 if p is not in the FOAF / then 

9 Call lnvitePeer(p, p'); 
//Asynchronous method 

10 \1 the previous invitation has been accepted then 

11 [ Insert p' in the FOAF file /; 

12 return the updated FOAF file /; 



4.4 Gossiping mapping entries 

To conclude this section, we discuss the gossip behavior of each peer. An active thread 
describes how a peer p initiates a periodic gossip exchange, while the passive thread 
takes care of a gossip exchange initiated by some other content peer p" . 

The active behavior is triggered after each time interval T Gossip. After incre- 
menting the age of its view entries by 1, the peer p selects from its view: (a) a peer 
p', being the oldest contact via select_oldest() and (b) a viewSubset, being a random 
subset of Mapping-content within the local semantic view of LGossip size (where 
< LGossip <= VGossip). Then, peer p send to p' a gossip message, a mes- 
sage that contains the viewSubset. Recall that each peer keeps in its LSV a set of the 
mappings containing a specific atom (see Figure 4). 

The peer p receives in exchange gossipM sg' containing similar information from 
p', and creates a viewEntry related to p', with age 0. Next, peer p discards duplicates 
view entries through using Merge. This lets taking care of the problem of redundant 
rewriting sequences. 

The passive behavior is triggered when peer p receives a gossip message containing 
Mapping-content and view entries from some peerp". Peer p answers by sending back 
a gossip message with its own Mapping-content and view information, and updates its 
local view with via merge() and select_recent(), and finally updates the local Mapping- 
content with respect to the new view as described previously. 

We omit the pseudocode of the Gossiping mapping entries Algorithm for lack of 
space. 

5 Experimental Evaluation 

We first describe in Section 5.1 the system setup. In Section 5.2, we assess the quality 
and efficiency of our rewriting technique, also with respect to traditional query refor- 
mulation approaches. We then focus on the scalability of our algorithms and their 
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Table 1: Heterogeneous scenarios used for experiments, 
robustness with respect to network churn in Section 5.3. 

5.1 Experimental Setup 

Dataset and mapping generation. We have conducted our evaluation in PeerSim [22], 
an open source simulator for P2P protocols. In order to tweak our system at best, we 
implemented a pseudo-randomized generator of relational schemas. Indeed, none of 
the available relational datasets could provide us enough heterogeneity to distribute on 
a large number of nodes in the network. Thanks to this generator, no peer's schema 
is identical to any other and, as a consequence, mappings are all distinct. Moreover, 
every peer has at least one acquaintance, connected to it via a mapping. This ensures 
that there are not semantically disconnected peers in the PDMS. 

The generator leverages a dictionary of about 40 names, ranging from table names 
to attribute names. We have designed a total of 10 scenarios (outlined in Table 1), by 
varying the number of peers in the PDMS, the number of mappings and the number 
of acquaintances, the latter ranging from a minimum of 1 to a 21, which is compatible 
with the diversity of the randomized schemas. 

In the above scenarios, the number of mappings from a peer to each of its acquain- 
tances ranges from 1 to 6, whereas each mapping has at most 3 atoms in the body/head. 
Moreover, each peer's schema has been randomly generated to contain at most 6 tables 
with at most 3 attributes each. The queries used in the experiments have been randomly 
generated to match the atoms in the body/head of the mappings, thus may contain in 
turn from 1 to 3 atoms. Finally, the FOAF files are initially empty in all experiments, 
and are incrementally filled, as soon as query reformulation starts. 
Qualitative measures and protocols for comparison. In each of the scenario depicted 
in Table 1, one or more queries are formulated on initiating peers and they fire a certain 
number of relevant rewritings, which represent all the rewritings for which the AF-IMF 
measure is greater than 0. To evaluate the quality of the top-k mappings, we run our 
query reformulation algorithms in a centralized implementation of our protocol, and 
take the returned results for each query as relevant rewritings. We have measured the 
recall, which is computed as follows: 

R,eC£lll Numberof RetrievedRelevantRewritings 

AF— IMF Total Numberof Relevant Rewritings 

Moreover, we have measured the time (in ms) taken to retrieve such relevant rewrit- 
ings. 

In order to gauge the effectiveness of our techniques and also to provide a yardstick 
for comparison, we have implemented the following protocols, that have been used 
throughout the experimental assessment: 
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Full The query gets translated against the relevant (using AF-IMF) rewriting se- 
quences, by exploiting LSV, gossiping and FOAF links. 

Full- The query gets translated against the relevant (using AF-IMF) rewriting se- 
quences, by exploiting LSV, gossiping (i.e. the protocol Full without FOAF links). 

Baseline^ The query gets translated against the relevant (using AF only) rewriting 
sequences. 

Baseline+ The original query gets translated against the mappings found in the 
traversal, and all its rewritings (relevant and non relevant) get propagated. 

Baseline The original query gets translated against the mappings found in the traver- 
sal, and gets propagated as it is. 

With Baseline and Baseline+, we have reimplemented the propagation strategy of 
existing approaches [14, 6], adopting, however, the bidirectional translation semantics 
of our system. 

Initial System Setup. We have executed an initial set of experiments, aiming at de- 
termining the gossip thresholds VGossip and LGossip. The former indicated the size 
of the Mapping-content table in the LSV, while the latter allows to control the size of 
a gossip message within each gossip cycle. Both parameters directly impact the effec- 
tiveness of the gossip protocol, since they indicate of what size an LSV and its buffer 
should be to harvest the highest number of relevant content in the network. 

From the experiments, that we omit for conciseness, we observe that a VGossip 
size of 500 entries is a good trade-off between number relevant rewritings retrieved and 
time, while varying the gossip cycles from 1 to 10. We also observe that, if we keep 
LGossip of the same size as VGossip (in other words, we disseminate the entire LSV 
in gossip messages) or smaller, the results in terms of rewritings are not affected much. 
Therefore, we opted for a value of LGossip = 100 throughout the analysis. 

Moreover, as the ranking function to use in the TranslateQuery algorithm, we have 
adopted the harmonic mean, which overcomes by 0.5% the other ranking functions 
(averaged on 10 gossip cycles). 

Finally, we have also empirically determined the maximum number of relevant 
requests REQS. We observed in a batch of initial experiments that the number of 
rewritings is affected by a value of REQS greater than only during the initial gossip 
cycles, whereas REQS = becomes the most preferable choice, when the number of 
gossip cycles increases. From these experiments, we could infer that REQS should 
be used as a dynamic threshold, and should have values slightly greater than when 
gossiping starts and drop to as long as gossip cycles reach 4. 

Also, we have set the threshold a of the number of query hops to unbounded, 
to be able to observe the behavior of our algorithms in the most general case. Our 
prototype has been implemented in Java and all experiments have been performed on a 
2.7 Ghz Intel Corei5 machine with 4GB RAM, running Windows 7 and JDK 6. For all 
experiments (unless otherwise specified), we have used a PDMS of 2000 peers with a 
configuration as in scenario 4 of Table 1. 

5.2 Qualitative Evaluation 

Recall and Comparison with previous approaches. As described above, we have 
measured the recall of our approach and compared it with the protocols Baseline and 
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Baseline +. From Figure 7 (a), we can observe that our protocol Full has the greatest 
recall along all the values of top-k mappings, if compared with all other protocols. In 
particular, the contribution of FOAF links to the recall is noticeable, since such links 
enable to connect network areas which would be otherwise unexplored in the query 
translation process, as shown by the trend of Full and Full-. Baseline, Baseline+ and 
Baseline^ have a lower recall, as they do not exploit the relevance measure AF-IMF, 
thus the mappings that they exploit during query translation are in most cases not rel- 
evant. As long as more mappings are traversed, their recall improves, until Baseline# 
reaches the same recall of Full-, while it never reaches 100% recall. The latter is only 
achieved by the Full protocol, by exploiting the FOAF links. Interestingly, this exper- 
iment showed the effectiveness of AF-IMF, LSV and gossiping (from Baseline up to 
Full-) and the utility of FOAF links (from Full- to Full). In particular, it can be ob- 
served that adopting a local metric for evaluating mappings (like the AF metric of the 
Baseline^ protocol) works better than using no metric at all (Baseline and Baseline+). 
However, it performs worse than using the AF-IMF global metric (like in Full- and 
Full), which has the most desirable behavior amongst all scenarios. 

Similarly, in terms of the number of relevant rewritings, as shown in Figure 7 (b), 
the Full protocol is the one that can harvest the highest number at any value of the 
top-k mappings. Finally, we have quantified the cost incurred by the Full and Full- 
protocols with respect to the Baseline protocols. The results are reported in Figure 7 
(c), which reports the time averaged on 10 queries. We can observe that the times 
undertake a certain increase, due mainly to the gossiping active and passive threads, 
and to the computation of relevance for Full- and, additionally, to the FOAF linking for 
Full. However, these times are still reasonable as the latter protocols allow a significant 
increase of the recall (as shown in Figure 7 (a)). 

Precision of distributed IMF. Next, we conducted another experiment to gauge the 
effectiveness of our distributed technique to compute AF-IMF. We have defined the 
precision of distributed IMF as follows: 

T3„„: :„„ _ C omputedl M FV alue 

ritju&iuiiiMF — ExpectedlMFVal ue 



We recall that IMF only depends on the query and not on the mappings, whereas 
AF depends on the mappings. There are no false positives in the query reformulation 
process, therefore the precision of AF cannot be determined. For such a reason, the 
precision we have measured is defined on IMF. 

By measuring such precision while varying the gossip cycles, we indirectly mea- 
sure the effectiveness of the LSV. We can observe that in about 3 gossip cycles, the 
number of inquiries converges to REQS — 0, meaning that the LSV has fetched 
enough relevant tuples from the outside and is self-contained. The backward precision 
has a similar trend, and is omitted for lack of space. 

Effectiveness of FOAF links. Figure 8 (b) shows another experiment we have con- 
ducted to gauge the increase of the number of FOAF links as the number of gossip 
cycles grows. Such increase is not affected much by the threshold REQS, thus con- 
firming that the converging value of REQS — already conveys enough FOAF links. 
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Figure 7: (a) Recall, (b) # of Relevant Rewritings, (c) Total Time (ms) and (d) Network Churn 
from 2000 peers down to 1100 peers; 10 queries. 



5.3 Scalability and Churn 

We now assess the robustness of our techniques in large-scale PDMS, by varying the 
number of peers (spanning all scenarios in Table 1). In Figure 8 (c) and (d), both the 
average time and # of relevant rewritings have a linear growth as the number of peers 
increases. This confirms that our techniques are scalable. 

In the next experiment, we have simulated the network churn, by starting from an 
initial configuration of 2000 peers, and forcing 100 peers at a time to leave the network. 
The aim of this experiment was twofold, to measure the robustness of our PDMS to 
churn, and to show the utility of FOAF links in a situation in which acquantainces of 
the peers (along with their mappings) quit the network. 

Figure 7 (d) shows that the Full- protocol (without FOAF) gets a few relevant 
rewritings after a cutoff point, i.e. when the # of peers drops to 1500; indeed, the 
useful acquaintances have left, and no FOAF link can be exploited to get new rewrit- 
ings. Conversely, the Full protocol scales gracefully as the # of peers decreases and 
exhibit a linearly decreasing number of rewritings, thus showing the utility of FOAF 
links. 
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Figure 8: (a) Distributed IMF Relevance Forward Precision, (b) Impact of REQS on FOAF 
Links, (c) Scalability wrt. # of Peers (Time) and (d) Scalability wrt. # of Peers (Nr. of 
Rewritings); 10 queries. 

6 Related Work 

There has been a great deal of work on data management in P2P databases on is- 
sues ranging from schema mediation [14] to mapping data values [17], query process- 
ing [19] and query translation [14, 6]. 

Kementsietsidis et al. [17] describes a set of algorithms for exchanging data among 
peers, by only leveraging constraints on such exchange under the form of mapping 
tables, that comprise data values of the local peer and of external peers. Constraints on 
the content of peers under the form of logical rules are also studied in theoretical work 
on data integration [20] . 

The only previous work that considered query reformulation in this context is [14, 
6]. In Piazza [14], each peer is equipped with inclusion and equality mappings and a 
set of local storage descriptions. Query answering is done by evaluating the contain- 
ment of any arbitrary external conjunctive query against the mappings and the storage 
descriptions. However, no approximation of the local peer mappings with suitable data 
structures is adopted. Moreover, [14] relies on a centralized index rather than on a dis- 
tributed one. A schema mapping and query translation framework for XML databases 
is presented in both [25, 6], which disregards the problem of ranking mappings based 
on relevance, as we do in this paper. 

We focus on individual rewritings in this work and adopt the query rewriting se- 
mantics of [6, 12]. Query rewriting with respect to a set of views is addressed in 
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Minicon [24], where views are joined to return the maximally contained rewritings for 
LAV data integration. 

Data integration in the presence of a global mediated ontology, relational data 
sources and GAV mappings is also addressed in [8]. 

Efficient XML query processing in P2P [19], leveraged multi-level Bloom-filters. 
However, we are not focusing on query optimization for XML. 

Finally, Kantere et al. [16] consider the problem of clustering peers based on their 
common interests in unstructured networks. Contrarily to our approach, they utilize 
metrics to compare a query and its rewritings, that are applied after the rewritings have 
been computed and not beforehand, as in our approach. Moreover, our global AF-IMF 
metric is the first to take into account the entire collection of mappings in the network. 
The idea of quantifying the information transfer of individual schema mappings with 
local metrics is the subject of recent work [2]. However, no global metrics in a social 
and distributed context are considered. 

Gossiping as a mean to enter diverse semantic domains is used in [1], where basi- 
cally mappings between peers may not be correct or simply not be aligned with a given 
domain. Therefore, the paper shows how local mappings can be used to establish a 
global semantic agreement among the peers. 

7 Conclusions and Future Work 

To conclude, in this paper we have studied the problem of semantic query reformu- 
lation in social PDMS. We have presented a new notion of relevance of a query to 
a mapping and introduced a global metric for ordering the mappings considered in 
query rewriting. Future work is devoted to study the impact of query personalization, 
the combination with other quality metrics and the extension to unions of conjunctive 
queries (UCQs). 
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